From Adaboost to XGBoost
Dublin Data Science Meetup | 17th May 2017
From Adaboost to XGBoost
The headlines
Regarding
Adaboost
"For many classification algorithms, this simple strategy results in dramatic improvements. We show that this seemingly mysterious phenomenon can be understood in terms of well know statistical principles…"
Gradient Boosting is introduced in this paper
Boosting
Idea of Boosting goes back to at least 1990 Schapire paper The Strength of Weak Learnability
Adaboost algorithm introduced in A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting (Freund & Schapire 1995)
The idea is to produce a strong learner by combining (adding together) a number of weak learners
A learner is any classification (or regression) rule
For a simple 2 class { -1,1 } classification problem, a Boosted Classifier has the form
\[
G(x) = \alpha_{1}G_{1}(x) + \alpha_{2}G_{2}(x) + \dots + \alpha_{M}G_{M}(x)
\]
combining \(M\) simple classifiers \(G_m\) with weights (\(\alpha_m\)) to create a majority predictor \(G\).
If \(G(x) > 0\) then \(1\) else \(-1\).
Freund & Schapire's approach was to fit the ensemble model in a stagewise fashion (one-by-one)
Each stage would concentrate on cases that have been, up to that point, classified incorrctly (adaptive)
The weight of each new learner would depend on how effective it is
For m = 1 to M:
Classifer \[h_m(x) \in \{-1,1\}\]
Error \[e_t = E_{w}[1_{(y\neq h_m(x)})] \]
Model Weight \[\alpha_{m} = \frac{1}{2}ln\left(\frac{1-e_{m}}{e_{m}} \right)\]
Case Reweighting \[ w_{m+1} = w_m\exp(\alpha_{m}\cdot1_{(y_i\neq G_m(x_i)}) \]
Set \[H_m(x) = H_{m-1}(x) + \alpha_m h_m(x)\]
Statistical view of boosting
\[x^{(m+1)} = x^{(m)} + \Delta \times - \frac{df(x^{(m)})}{dx}\]
\[f_{m+1}(x) = f_{m} + \Delta \times -\frac{d\ell}{df_m} \]
Squared-Error loss function \[ \ell(x_{i},y,f) \propto \frac{1}{2}(y_i-f(x_i))^2 \]
Then the derivative \[ \frac{d\ell}{df} = -(y_i-f(x_i)) =-r_i \] suggests to add a function that maps \(x_i\) to its residual \(r_i\).
So for Squared-Error loss, the gradient tells us to update by adding a function that maps \(x_i\) to its current residual \(r_i\).
We find the function from our basis set (trees, linear models, etc.) which brings us as close as possible to returning the residuals (i.e. we model the residuals)
We update our model so, descending the loss gradient: \[f_m(x) = f_{m-1}(x) + \alpha_m b_m(x)\]
Poisson: \[y_i - exp(f(x_i))\]
Bernoulli: \[ y_i - \frac{1}{1 + exp(-f(x_i))} \]
What Adaboost uses: \[ -2(y_i-1)*exp(-2(y_i-1)f(x_i)) \]
Tree Models
Bias and Variance
Bias and Variance have traditionally been seen as a trade off
Low Bias => High Variance
Low Variance => High Bias
This is true for static models but modern algorithmic methods attempt to overcome it.
Both are additive ensembles of trees
Random Forest takes a low-bias, high-variance modelling procedure (deep trees) and tries to reduce the variance
GB takes a low-variance, high-bias modelling procedure (e.g. stumps/shallow trees) and tries to correct the bias
Also in GB, as the number of boosting steps gets larger, bias correction plays a decreasing role, but averaging kicks in, making it often immune to overfitting or variance increase (internal averaging)
Gradient Boosting packages
Fitting a boosted model
Loss function: Very Important Align it to your end objective as closely as possible
Hyperparameter tuning: Quite Important There are many methods: heuristics, grid search, adaptive grid search, random search, Bayesian optimisation ..
Feature engineering: Can be important Very specific to data and problem
Complexity: max_depth, min_child_weight and gamma
Subsampling: subsample, colsample_bytree
Learning Progression: eta, num_round
Imbalanced data: scale_pos_weight, max_delta_step
From Owen Zhang Presentation http://www.slideshare.net/OwenZhang2/tips-for-data-science-competitions
The Higgs challenge
\[AMS=\sqrt{2\left((s+b+b_r)log\left(1+\frac{s}{b+b_r}\right)−s\right)}\]
xgmat <- xgb.DMatrix(data, label = label, weight = weight, missing = -999.0)
param <- list("objective" = "binary:logitraw",
"scale_pos_weight" = sumwneg / sumwpos,
"bst:eta" = 0.1,
"bst:max_depth" = 6,
"eval_metric" = "auc",
"eval_metric" = "ams@0.15",
"silent" = 1,
"nthread" = 16)
watchlist <- list("train" = xgmat)
nround = 120
bst = xgb.train(param, xgmat, nround, watchlist );
Parameter tuning with Caret
xgb_grid_3 <- expand.grid( nrounds = c(150, 250), eta = c( 0.01, 0.1, 0.2), max_depth = c(4,6), gamma = c(0, 1), colsample_bytree = c(0.66, 1), min_child_weight = c(5,10,20) ) nrow(xgb_grid_3)
[1] 144
| nrounds | eta | max_depth | gamma | colsample_bytree | min_child_weight |
|---|---|---|---|---|---|
| 150 | 0.01 | 4 | 0 | 0.66 | 5 |
| 250 | 0.01 | 4 | 0 | 0.66 | 5 |
| 150 | 0.10 | 4 | 0 | 0.66 | 5 |
| 250 | 0.10 | 4 | 0 | 0.66 | 5 |
| 150 | 0.20 | 4 | 0 | 0.66 | 5 |
| 250 | 0.20 | 4 | 0 | 0.66 | 5 |
| 150 | 0.01 | 6 | 0 | 0.66 | 5 |
| 250 | 0.01 | 6 | 0 | 0.66 | 5 |
| 150 | 0.10 | 6 | 0 | 0.66 | 5 |
| 250 | 0.10 | 6 | 0 | 0.66 | 5 |
| 150 | 0.20 | 6 | 0 | 0.66 | 5 |
| 250 | 0.20 | 6 | 0 | 0.66 | 5 |
| 150 | 0.01 | 4 | 1 | 0.66 | 5 |
| 250 | 0.01 | 4 | 1 | 0.66 | 5 |
| 150 | 0.10 | 4 | 1 | 0.66 | 5 |
| 250 | 0.10 | 4 | 1 | 0.66 | 5 |
| 150 | 0.20 | 4 | 1 | 0.66 | 5 |
| 250 | 0.20 | 4 | 1 | 0.66 | 5 |
| 150 | 0.01 | 6 | 1 | 0.66 | 5 |
| 250 | 0.01 | 6 | 1 | 0.66 | 5 |
| 150 | 0.10 | 6 | 1 | 0.66 | 5 |
| 250 | 0.10 | 6 | 1 | 0.66 | 5 |
| 150 | 0.20 | 6 | 1 | 0.66 | 5 |
| 250 | 0.20 | 6 | 1 | 0.66 | 5 |
| 150 | 0.01 | 4 | 0 | 1.00 | 5 |
| 250 | 0.01 | 4 | 0 | 1.00 | 5 |
| 150 | 0.10 | 4 | 0 | 1.00 | 5 |
| 250 | 0.10 | 4 | 0 | 1.00 | 5 |
| 150 | 0.20 | 4 | 0 | 1.00 | 5 |
| 250 | 0.20 | 4 | 0 | 1.00 | 5 |
| 150 | 0.01 | 6 | 0 | 1.00 | 5 |
| 250 | 0.01 | 6 | 0 | 1.00 | 5 |
| 150 | 0.10 | 6 | 0 | 1.00 | 5 |
| 250 | 0.10 | 6 | 0 | 1.00 | 5 |
| 150 | 0.20 | 6 | 0 | 1.00 | 5 |
| 250 | 0.20 | 6 | 0 | 1.00 | 5 |
| 150 | 0.01 | 4 | 1 | 1.00 | 5 |
| 250 | 0.01 | 4 | 1 | 1.00 | 5 |
| 150 | 0.10 | 4 | 1 | 1.00 | 5 |
| 250 | 0.10 | 4 | 1 | 1.00 | 5 |
| 150 | 0.20 | 4 | 1 | 1.00 | 5 |
| 250 | 0.20 | 4 | 1 | 1.00 | 5 |
| 150 | 0.01 | 6 | 1 | 1.00 | 5 |
| 250 | 0.01 | 6 | 1 | 1.00 | 5 |
| 150 | 0.10 | 6 | 1 | 1.00 | 5 |
| 250 | 0.10 | 6 | 1 | 1.00 | 5 |
| 150 | 0.20 | 6 | 1 | 1.00 | 5 |
| 250 | 0.20 | 6 | 1 | 1.00 | 5 |
| 150 | 0.01 | 4 | 0 | 0.66 | 10 |
| 250 | 0.01 | 4 | 0 | 0.66 | 10 |
| 150 | 0.10 | 4 | 0 | 0.66 | 10 |
| 250 | 0.10 | 4 | 0 | 0.66 | 10 |
| 150 | 0.20 | 4 | 0 | 0.66 | 10 |
| 250 | 0.20 | 4 | 0 | 0.66 | 10 |
| 150 | 0.01 | 6 | 0 | 0.66 | 10 |
| 250 | 0.01 | 6 | 0 | 0.66 | 10 |
| 150 | 0.10 | 6 | 0 | 0.66 | 10 |
| 250 | 0.10 | 6 | 0 | 0.66 | 10 |
| 150 | 0.20 | 6 | 0 | 0.66 | 10 |
| 250 | 0.20 | 6 | 0 | 0.66 | 10 |
| 150 | 0.01 | 4 | 1 | 0.66 | 10 |
| 250 | 0.01 | 4 | 1 | 0.66 | 10 |
| 150 | 0.10 | 4 | 1 | 0.66 | 10 |
| 250 | 0.10 | 4 | 1 | 0.66 | 10 |
| 150 | 0.20 | 4 | 1 | 0.66 | 10 |
| 250 | 0.20 | 4 | 1 | 0.66 | 10 |
| 150 | 0.01 | 6 | 1 | 0.66 | 10 |
| 250 | 0.01 | 6 | 1 | 0.66 | 10 |
| 150 | 0.10 | 6 | 1 | 0.66 | 10 |
| 250 | 0.10 | 6 | 1 | 0.66 | 10 |
| 150 | 0.20 | 6 | 1 | 0.66 | 10 |
| 250 | 0.20 | 6 | 1 | 0.66 | 10 |
| 150 | 0.01 | 4 | 0 | 1.00 | 10 |
| 250 | 0.01 | 4 | 0 | 1.00 | 10 |
| 150 | 0.10 | 4 | 0 | 1.00 | 10 |
| 250 | 0.10 | 4 | 0 | 1.00 | 10 |
| 150 | 0.20 | 4 | 0 | 1.00 | 10 |
| 250 | 0.20 | 4 | 0 | 1.00 | 10 |
| 150 | 0.01 | 6 | 0 | 1.00 | 10 |
| 250 | 0.01 | 6 | 0 | 1.00 | 10 |
| 150 | 0.10 | 6 | 0 | 1.00 | 10 |
| 250 | 0.10 | 6 | 0 | 1.00 | 10 |
| 150 | 0.20 | 6 | 0 | 1.00 | 10 |
| 250 | 0.20 | 6 | 0 | 1.00 | 10 |
| 150 | 0.01 | 4 | 1 | 1.00 | 10 |
| 250 | 0.01 | 4 | 1 | 1.00 | 10 |
| 150 | 0.10 | 4 | 1 | 1.00 | 10 |
| 250 | 0.10 | 4 | 1 | 1.00 | 10 |
| 150 | 0.20 | 4 | 1 | 1.00 | 10 |
| 250 | 0.20 | 4 | 1 | 1.00 | 10 |
| 150 | 0.01 | 6 | 1 | 1.00 | 10 |
| 250 | 0.01 | 6 | 1 | 1.00 | 10 |
| 150 | 0.10 | 6 | 1 | 1.00 | 10 |
| 250 | 0.10 | 6 | 1 | 1.00 | 10 |
| 150 | 0.20 | 6 | 1 | 1.00 | 10 |
| 250 | 0.20 | 6 | 1 | 1.00 | 10 |
| 150 | 0.01 | 4 | 0 | 0.66 | 20 |
| 250 | 0.01 | 4 | 0 | 0.66 | 20 |
| 150 | 0.10 | 4 | 0 | 0.66 | 20 |
| 250 | 0.10 | 4 | 0 | 0.66 | 20 |
| 150 | 0.20 | 4 | 0 | 0.66 | 20 |
| 250 | 0.20 | 4 | 0 | 0.66 | 20 |
| 150 | 0.01 | 6 | 0 | 0.66 | 20 |
| 250 | 0.01 | 6 | 0 | 0.66 | 20 |
| 150 | 0.10 | 6 | 0 | 0.66 | 20 |
| 250 | 0.10 | 6 | 0 | 0.66 | 20 |
| 150 | 0.20 | 6 | 0 | 0.66 | 20 |
| 250 | 0.20 | 6 | 0 | 0.66 | 20 |
| 150 | 0.01 | 4 | 1 | 0.66 | 20 |
| 250 | 0.01 | 4 | 1 | 0.66 | 20 |
| 150 | 0.10 | 4 | 1 | 0.66 | 20 |
| 250 | 0.10 | 4 | 1 | 0.66 | 20 |
| 150 | 0.20 | 4 | 1 | 0.66 | 20 |
| 250 | 0.20 | 4 | 1 | 0.66 | 20 |
| 150 | 0.01 | 6 | 1 | 0.66 | 20 |
| 250 | 0.01 | 6 | 1 | 0.66 | 20 |
| 150 | 0.10 | 6 | 1 | 0.66 | 20 |
| 250 | 0.10 | 6 | 1 | 0.66 | 20 |
| 150 | 0.20 | 6 | 1 | 0.66 | 20 |
| 250 | 0.20 | 6 | 1 | 0.66 | 20 |
| 150 | 0.01 | 4 | 0 | 1.00 | 20 |
| 250 | 0.01 | 4 | 0 | 1.00 | 20 |
| 150 | 0.10 | 4 | 0 | 1.00 | 20 |
| 250 | 0.10 | 4 | 0 | 1.00 | 20 |
| 150 | 0.20 | 4 | 0 | 1.00 | 20 |
| 250 | 0.20 | 4 | 0 | 1.00 | 20 |
| 150 | 0.01 | 6 | 0 | 1.00 | 20 |
| 250 | 0.01 | 6 | 0 | 1.00 | 20 |
| 150 | 0.10 | 6 | 0 | 1.00 | 20 |
| 250 | 0.10 | 6 | 0 | 1.00 | 20 |
| 150 | 0.20 | 6 | 0 | 1.00 | 20 |
| 250 | 0.20 | 6 | 0 | 1.00 | 20 |
| 150 | 0.01 | 4 | 1 | 1.00 | 20 |
| 250 | 0.01 | 4 | 1 | 1.00 | 20 |
| 150 | 0.10 | 4 | 1 | 1.00 | 20 |
| 250 | 0.10 | 4 | 1 | 1.00 | 20 |
| 150 | 0.20 | 4 | 1 | 1.00 | 20 |
| 250 | 0.20 | 4 | 1 | 1.00 | 20 |
| 150 | 0.01 | 6 | 1 | 1.00 | 20 |
| 250 | 0.01 | 6 | 1 | 1.00 | 20 |
| 150 | 0.10 | 6 | 1 | 1.00 | 20 |
| 250 | 0.10 | 6 | 1 | 1.00 | 20 |
| 150 | 0.20 | 6 | 1 | 1.00 | 20 |
| 250 | 0.20 | 6 | 1 | 1.00 | 20 |
xgb_trcontrol_3 <- trainControl( method = "cv", number = 10, verboseIter = TRUE, returnData = FALSE, returnResamp = "all", # save losses across all models classProbs = TRUE, # set to TRUE for AUC to be computed allowParallel = TRUE, summaryFunction = twoClassSummary )
xgb_train_3 <- train( x = data, y = label, weights = weight, trControl = xgb_trcontrol_3, tuneGrid = xgb_grid_3, method = "xgbTree", metric="ROC", scale_pos_weight = sumwneg / sumwpos, missing = NA )
ROC was used to select the optimal model using the largest value. The final values used for the model were nrounds = 250, max_depth = 6, eta = 0.1, gamma = 1, colsample_bytree = 1 and min_child_weight = 20
Useful xgboost helper functions
booster[0] 0:[f0<50.5] yes=1,no=2,missing=1,gain=19.1203,cover=10000 1:[f0<27.5] yes=3,no=4,missing=3,gain=6.62227,cover=5000 3:[f0<21.5] yes=7,no=8,missing=7,gain=0.0800078,cover=2700 7:leaf=-0.150987,cover=2100 8:leaf=-0.159073,cover=600 4:[f1<70.5] yes=9,no=10,missing=9,gain=5.12141,cover=2300 9:[f1<34.5] yes=15,no=16,missing=15,gain=14.039,cover=1610 15:[f1<27.5] yes=19,no=20,missing=19,gain=0.174287,cover=782 19:leaf=-0.151579,cover=621 20:leaf=-0.169178,cover=161 16:[f0<46.5] yes=21,no=22,missing=21,gain=2.27443,cover=828 21:[f1<62.5] yes=27,no=28,missing=27,gain=1.94476,cover=684 27:leaf=-0.229251,cover=532 28:leaf=-0.185909,cover=152 22:leaf=-0.174229,cover=144 10:leaf=-0.152837,cover=690 2:[f0<73.5] yes=5,no=6,missing=5,gain=6.72211,cover=5000 5:[f1<70.5] yes=11,no=12,missing=11,gain=5.20528,cover=2300 11:[f1<35.5] yes=17,no=18,missing=17,gain=14.2643,cover=1610 17:[f1<28.5] yes=23,no=24,missing=23,gain=0.457774,cover=805 23:leaf=-0.147592,cover=644 24:leaf=-0.125756,cover=161 18:[f0<54.5] yes=25,no=26,missing=25,gain=2.48841,cover=805 25:[f0<52.5] yes=29,no=30,missing=29,gain=0.0744164,cover=140 29:leaf=-0.134878,cover=70 30:leaf=-0.110093,cover=70 26:[f1<61.5] yes=31,no=32,missing=31,gain=2.41692,cover=665 31:leaf=-0.0678413,cover=494 32:leaf=-0.109706,cover=171 12:leaf=-0.146728,cover=690 6:[f0<79.5] yes=13,no=14,missing=13,gain=0.129898,cover=2700 13:leaf=-0.140428,cover=600 14:leaf=-0.14887,cover=2100 booster[1] …
http://xgboost.readthedocs.io/en/latest/parameter.html
https://github.com/dmlc/xgboost
Summing up